Vocabulary independent spoken query: a case for subword units

نویسندگان

  • Evandro B. Gouvêa
  • Tony Ezzat
چکیده

In this work, we describe a subword unit approach for information retrieval of items by voice. An algorithm based on the minimum description length (MDL) principle converts an index written in terms of words into an index written in terms of phonetic subword units. A speech recognition engine that uses a language model and pronounciation dictionary built from such an inventory of subword units is completely independent from the information retrieval task. The recognition engine can remain fixed, making this approach ideal for resource constrained systems. In addition, we demonstrate that recall results at higher out of vocabulary (OOV) rates are much superior for the subword unit system. On a music lyrics task at 80% OOV, the subword-based recall is 75.2%, compared to 47.4% for a word system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An STD system for OOV query terms using various subword units

We have been proposing a Spoken Term Detection (STD) method for Out-Of-Vocabulary (OOV) query terms using various subword units, such as monophone, triphone, demiphone, one third phone, and Sub-phonetic segment (SPS) models. In the proposed method, subword-based ASR is performed for all spoken documents and subword recognition results are generated using subword acoustic models and subword lang...

متن کامل

An STD System for OOV Query Terms Integrating Multiple STD Results of Various Subword units

We have been proposing a Spoken Term Detection (STD) method for Out-Of-Vocabulary (OOV) query terms integrating various subword recognition results using monophone, triphone, demiphone, one third phone, and Sub-phonetic segment (SPS) models. In the proposed method, subword-based ASR (Automatic Speech Recognition) is performed for all spoken documents and subword recognition results are generate...

متن کامل

Multilayer subword units for open-vocabulary spoken document retrieval

This paper describes the application of subword units in an effort of improving open-vocabulary spoken document retrieval performance in the case of highly corrupted recognition output. This paper presents the developed open-vocabulary spoken document retrieval system including the newly proposed subphonetic segment unit and combining multilayer subword units. Our experiments on Japanese spoken...

متن کامل

An Investigation of Subword Unit Representations for Spoken Document Retrieval

This study investigates the feasibility of using subword unit representations for spoken document retrieval as an alternative to using words generated by either keyword spotting or word recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recogn...

متن کامل

Hybrid word-subword decoding for spoken term detection

This paper deals with a hybrid word-subword recognition system for spoken term detection. The decoding is driven by a hybrid recognition network and the decoder directly produces hybrid word-subword lattices. One phone and two multigram models were tested to represent sub-word units. The systems were evaluated in terms of spoken term detection accuracy and the size of index. We concluded that t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010